This presentation about a data science scoping framework called Project Understanding.
02/03/2022
This presentation about a data science scoping framework called Project Understanding.
Learning a data science scoping framework is a great way to improve as a data scientist.
“A less technical data scientist can be more effective than a more technical data scientist with the right data science scoping framework.”
A data science process includes:
The data science workflow provides the what and the collaboration framework provides the how.
This workshop is not about data science processes is in general. To learn about data science processes I recommend the Data Science Process Alliance. This organisation provides blogs and courses (e.g. Data Science Team Lead).
A data science workflow provides guidance (i.e. what tasks should be performed and how) during data science projects. Data science workflows aim to improve project:
The results of a poll conducted by the DSPA of the most popular data science processes.
Three of these data science processes are data science workflows:
The most popular data science workflow is CRISP-DM.
The CRISP-DM data science workflow:
CRISP-DM consists of 6 phases and 24 tasks, see the CRISP-DM guide for details. The 6 phases of CRISP-DM are:
These phases are represented as vertices in a directed graph were the edges represent transitions between phases.
CRISP-DM guide, p. 10
CRISP-DM guide, p. 12
CRISP-DM is over thirty years old. NICD uses a modified CRISP-DM that has updated some of the phases, tasks and outcomes.
Project Understanding is a data science scoping framework. This presentation only focuses on the first phase of the NICD workflow.
Why?
Project Understanding has four tasks:
I would like to focus on the first and third tasks.
The determine project objectives task has three outcomes:
The background is a paragraph that provides the context to the project. For example, the background for your projects would include information about the global climate change and its effect on the north east.
The project objectives task is the most important task.
A project objective can be the difference between a project being a success or a failure.
So how to you determine a good project objective?
Be as specific as you can! Global climate change is a very complex problem. Break a complex problem down into simple problems and determine a simple project objective.
A project objective must be measurable to know if the project has been a success or a failure (next outcome).
These are less important:
“In science, if you know what you are doing, you should not be doing it. In engineering, if you do not know what you are doing, you should not be doing it.” - Richard Hamming
In data science you don’t know what you are doing. So you probably don’t know if it is achievable or time-bound.
Relevance is important. All projects have stakeholders. These are the people that determine if your project objectives are relevant or not.
Think about the result of a project success. Would this satisfy the stakeholders?
Since your project objective is measurable. The project success criteria is just a threshold. If the measure is above/below the threshold the project is a success/failure.
Determining the threshold requires the project stakeholders.
Try producing a project objective and project success criteria for your project.
The determine data science question task has two outcomes:
A data science question is a question that requires data to answer. In the paper What is the question?, Jeffery Leek and Roger Peng identify six types of data science question.
Identifying the type of a question can be very hard. To make this easier we introduce some terms.
Questions about unobserved data require models. These questions are model-based questions.
It is well known that correlation does not imply causation. So it is important to know when a question requires a causal answer or not.
Model-based
Why are question types important?
Project objectives might require question types that requires:
This can help identify areas of improvement.
data science success criteria are about identifying when data science questions have been answered sufficiently. A criterion can be objective or subjective:
The person responsible for each subjective success criterion must be included.
Try producing a data science question and data science success criteria for your project.
This is the most important figure.
I am convinced that good project objectives, relevant data and good data science questions is what differentiates good and great data scientists.